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ABSTRACT 

The  1984/85  accomplishments  of  the  research  project  "High  Performance  Parallel 
Computing"  included  bringing  the  prototype  of  the  Texas  Reconfigurable  Array 
Computer  (TRAC)  to  a  configuration  and  to  a  state  of  stability  where  it  could  support 
execution  of  simple  assembly  language  programs,  initial  development  of  a  unified  model 
of  parallel  computation  which  is  a  basis  for  a  programming  environment  uniting  process 
and  data  flow  models  of  parallel  computation,  bringing  to  operational  status  on  an 
alternative  host  one  of  the  two  parallel  programming  languages  (the  Computation 
Structures  Language,  CSL)  originally  intended  for  use  on  TRAC,  exploration  of  the 
expressive  capabilities  of  this  programming  language,  initiation  of  development  of  a 
graphical  programming  language  based  on  the  unified  model  of  parallel  computation 
mentioned  preceding,  major  progress  on  a  graphically  interfaced  Petri  Net-based 
performance  modeling  system  for  parallel  computations  and  development  of  algorithms 
for  scheduling  of  circuits  to  realize  configurations  in  configurable  banyan  network  based 
computer  architectures. 

The  TRAC  configuration  which  became  available  for  use  during  this  period  is  a  four- 
processor  nine-memory  system  with  two  10  MBYTE  Winchester  disks.  The  VAX 
11/750  obtained  under  the  DoD  Research  Equipment  program  is  used  as  both  front- 
and  back-end  for  TRAC.  This  configuration  is  not  able  to  support  development  of 
major  software  systems  but  can  be  used  for  small  scale  experiments  in  reconfigurable 
computation.  Several  such  experiments  were  conducted  including  one  which  attached 
an  image  input  device  to  the  auxiliary  resource  interface  developed  in  1983  and  reported 
in  the  Final  Report  for  Contract  F49620-83-C-0049. 

Substantial  existing  Fortran  programs,  up  to  4.000  lines  of  code,  have  been  given 
parallel  structure  in  the  Computation  Structures  Language  (CS1)  and  used  as  test 
programs  for  the  concepts  and  implementation  of  CSL. 


The  computations  of  the  scheduling  algorithms  developed  for  selection  of  circuits  to 
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of  the  switch.  This  distribution  and  resulting  parallel  execution  gives  log(n)  execution 
time  for  circuit  scheduling,  (n  is  number  of  base  or  apex  nodes  in  the  network.) 

It  was,  during  this  year,  recognized  that  the  network  structure  which  couples 
processors  and  memories  in  TRAC  can  be  adapted  and  extended  to  couple  computers  to 
external  memory  systems  to  provide  a  very  high  performance  highly  parallel  I/O 
system. 

1.  Research  Objectives 

The  research  objectives  of  the  project  "High  Performance  Parallel  Computing"  was  an 
integrated  approach  to  parallel  computation  coupling  the  development  of  a  novel 
reconfigurable  parallel  computer  architecture,  the  Texas  Reconfigurable  Array 
Computer  (TRAC),  with  the  development  of  software  for  the  architecture  and  design  of 
algorithms  which  could  effectively  utilize  the  capabilities  of  the  architecture.  This 
project,  which  we  shall  refer  to  as  the  TRAC  Project,  was  initiated  in  1978.  The 
research  accomplishments  reported  for  1984/85  are  in  context  of  this  substantial  and 
long-lived  project.  The  specific  objectives  for  1984/85  included  bringing  the  prototype 
of  TRAC  to  a  configuration  and  to  a  state  of  stability  where  it  could  support 
development  of  software  systems  and  applications,  to  bring  into  experimental  use  the 
software  systems  designed  and  partially  implemented  in  previous  research  and  to 
explore  the  effectiveness  of  the  reconfigurable  execution  environment  provided  by 
TRAC  on  significant  algorithms. 

A  summary  of  architectural  concepts  for  TRAC  will  provide  a  context  for  the 
reporting  of  the  research  accomplishments  given  following.  The  fundamental  concept  of 
TRAC  is  that  an  effective  execution  environment  for  parallel  computations  can  be 
realized  by  having  a  computer  architecture  which  can  be  configured  to  the  arithmetic 
and  communication  requirements  of  the  algorithms  and  applications  it  is  executing. 
Configurability  is  attained  by  having  processor  and  memory  resources  connected  by  a 
banyan  interconnection  network  which  supports  the  establishment  of  circuits  at  runtime 
and  which  also  implements  memory  to  processor  packet  routing.  The  processors  are 


placed  at  the  apex  of  the  network  and  the  memories  at  the  base  of  the  network.  Figure 
1  illustrates  this  concept  with  a  four-processor  nine-memory  configuration  connected  by 
a  switch  with  nodes  having  spread  of  two  and  fan-out  of  three.  A  "computer"  is 
realized  by  establishing  circuits  in  the  network  coupling  processors  to  memories  and 
coupling  "Computers"  to  "Computers."  Figure  2  shows  a  configuration  realizing  an 
MIMD  configuration  with  each  "computer"  being  made  up  of  two  processors  coupled 
through  the  switch,  and  two  memory  units.  The  two  computers  also  share  a  memory 
unit.  It  is  straightforward  to  see  that  this  procedure  can  be  used  to  construct  a 
spectrum  of  architectures  ranging  across  SIMD  and  MIMD  and  including  both  message- 
based  and  shared-memory-based  configurations.  A  more  extensive  discussion  can  be 
found  in  Browne  and  Lipovski  [BR082a|. 
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simple  programs. 

The  memory  board  which  was  actually  carried  to  completion  for  inclusion  was  a 
redesign  of  the  original  board  which  corrected  timing  errors  and  incorporated  much 
greater  margins  for  stability.  This  redesign  was  carried  out  on  the  VALID  Logic  ECAD 
design  system  purchased  with  the  matching  funds  provided  by  the  University  of  Texas 
in  connection  with  DoD  Research  Equipment  Grant  Number  AFOSR-83-0315.  The 
addition  of  this  more  stable  memory  board  enabled  a  sustained  push  of  hardware 
testing  (which  was  made  possible  by  the  Tektronix  Logic  Analyzer  purchased  under  the 
DoD  Research  Equipment  Grant  Number  AFOSR-83-0315)  which  brought  the  four- 
processor  nine-memory  configuration  of  TRAC  to  a  point  where  it  could  be  and  was 
used  for  execution  of  demonstration  codes  and  simple  applications.  The  assembler  and 
loader  which  had  been  hosted  on  the  Research  DEC  2060  of  the  Computer  Science 
Department  were  ported  to  the  TRAC  VAX  to  facilitate  code  development  and  loading. 
The  most  significant  application  attempted  was  coupling  of  an  image  digitizer  to  the 
Auxiliary  Resource  Interface  (ARI)  referenced  in  the  Final  Report  for  Grant  F49620-83- 
C-0049.  Code  for  simple  image  processing  applications  was  developed  and  run  on  the 
digitized  images. 

2.2.  Algorithms  and  Theory 

A  fundamental  problem  for  a  configurable  network  architectured  computer 
architecture  is  development  of  algorithms  for  selection  of  the  processors,  memories  and 
switch  resources  with  which  to  establish  specific  computer  configurations.  The  problem 
arises  because  the  Banyan  interconnection  network  is  a  blocking  network  which  cannot 
simultaneously  realize  circuits  giving  full  connectivity  between  all  processors  and  all 
memories.  Effective  use  of  the  processor  and  memory  resources  thus  requires  effective 
algorithms  for  construction  of  configurations.  In  addition  to  yielding  efficient  use  of 
resources,  the  algorithm  must  be  sufficiently  efficient  to  be  frequently  executed  during 
the  course  of  execution  of  a  computation.  It  must  be  applicable  to  partial 
configurations  of  resources  since  the  total  system  resources  may  be  shared  by  several 
jobs.  Feo  [FE085]  has  developed  and  evaluated  by  simulation  a  spectrum  of 
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algorithms.  All  have  the  common  property  that  the  computations  of  the  algorithm  are 
distributed  across  the  nodes  of  the  network.  All  function  by  broadcast  of  signals 
through  the  network,  first  from  apex  nodes  to  base  nodes  and  then  back  to  the  apex 
nodes.  Each  pass  generates  state  information  about  the  number  of  paths  through  the 
network  which  will  be  consumed  by  selection  on  a  given  resource  configuration.  There 
exists  an  algorithm  which  can  be  shown  to  minimize  loss  of  connectivity  in  the  network 
[BIT84].  This  algorithm  is  rather  complex.  The  results  of  Feo’s  research  show  that 
simple  algorithms  whose  functionality  can  be  implemented  by  adding  fewer  than  150 
gates  to  the  switch  nodes  give  near  optimal  results  over  a  wide  range  of  network  states 
with  log(n)  computation  time  where  n  is  proportional  to  the  number  of  apex  and  base 
nodes. 

There  are  a  great  number  of  models  of  parallel  computation.  Some  of  the  more 
popular  are  the  process  model,  the  several  flavors  of  data  flow  and  the 
functional/applicative  model.  It  has  been  the  case  in  the  past  that  each  of  these  models 
of  computation  has  spawned  its  ow  n  programming  languages  and  architectures.  Browne 
[BR085a,  BR085b]  has  developed  a  unified  model  of  parallel  computation  which 
integrates  all  of  the  common  models  of  parallel  computation  at  a  somewhat  abstract 
level.  The  essential  concept  is  to  generalize  dependency  relations  and  to  formulate 
computations  as  graph  structures  where  the  nodes  are  schedulable  units  of  computation 
and  the  arcs  are  generalized  dependency  relations.  This  model  of  parallel  computation 
allows  a  single  computation  graph  to  include  several  different  types  of  dependency 
relations.  This  model  of  parallel  computation  leads  naturally  to  the  concept  of  a 
graphical  programming  language.  This  paradigm  for  graphical  programming  is 
described  briefly  in  the  following  section. 

2.3.  Parallel  Programming  System 

Implementation  of  the  Computation  Structures  Language  (CSL)  [BR082b]  on  the  dual 
CDC  Cyber  170/750  configuration  of  the  University  of  Texas  Computation  Center  wa* 
completed  during  1984/85.  The  choice  of  the  dual  Cybers  as  the  host  for 
implementation  of  CSL  was  made  because  the  configuration  of  TRAC  which  was 


available  would  not  support  the  runtime  system  of  CSL.  The  four- processor  nine- 
memory  configuration  has  a  total  of  576  KBYTES  of  memory.  It  did  not  have  the 
interrupt  handling  capabilities  implemented.  There  remained  problems  of  intermittent 
hardware  failures.  The  circumstances  made  it  impractical  to  attempt  implementation  of 
the  operating  system  and  programming  language  systems  designed  for  TRAC.  The 
Computation  Center  of  the  University  of  Texas  added  to  the  UT-2D  operating  system 
the  necessary  primitives  for  implementation  of  generalized  dependency  relations.  A 
number  of  programs  of  existing  Fortran  programs  of  substantial  size  have  been  given 
parallel  structure  at  the  module  level  and  executed  under  control  of  the  CSL  runtime 
system.  The  largest  of  these  programs  is  a  4,000  line  molecular  integrals  code. 

The  CSL  runtime  system  was  also  used  as  a  vehicle  for  the  study  of  parallel 
structuring  of  resource  management  systems  [ODE85].  The  result  was  a  relationship 
defining  the  degree  of  parallelism  necessary  for  the  CSL  runtime  system  to  execute  a 
given  workload. 

We  have  also  used  CSL  as  a  vehicle  for  study  of  the  amount  of  parallelism  in  an 
algorithm  which  can  be  readily  captured  in  programming  languages.  CSL  is  a  language 
for  representing  dependency  graphs  at  the  task  level  and  for  expressing  a  traversal  of 
the  graph  to  execute  the  computation.  CSL  incorporates  both  message  and  shared 
memory  models  for  implementation  of  dependency  relations.  It  was  found  in  several 
studies  that  the  use  of  both  types  of  dependency  relations  aided  in  obtaining  compact 
parallel  computation  structures  which  yield  most  of  the  potential  parallelism  in  an 
algorithm.  Appendix  A  develops  one  example  CSL  program  from  a  specification  of  an 
algorithm  for  solution  of  a  set  of  linear  equations  with  a  triangular  coefficient  matrix. 

We  also  initiated  during  this  period  implementation  of  two  other  programming 
systems  for  parallel  computation.  The  Task  Level  Data  Flow  Language  (TDFL)  uses  an 
explicit  specification  of  a  dependency  graph  in  terms  of  a  subset  of  the  generalized 
dependency  relations  defined  by  Browne  [BR085b]  but  leaves  the  traversal  of  the  graph 
to  be  determined  dynamically  at  runtime.  A  prototype  implementation  of  TDFL  has 
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been  accomplished  since  the  conclusion  of  the  period  of  this  grant.  We  also  have  begun 
design  and  implementation  of  a  graphical  programming  interface  in  which  a 
programmer  specifies  a  parallel  program  directly  as  a  dependency  graph  where  the 
nodes  of  the  graph  are  pre-programmed  schedulable  units  of  computation  and  the 
properties  of  the  generalized  dependency  graph  are  declaratively  specified. 

2.4.  Performance  Modeling  of  Parallel  Computations 

There  are  always  many  possible  parallel  structures  for  a  given  computation.  The 
factors  which  can  be  varied  include  granularity  of  the  schedulable  units  of  computation 
and  choice  of  types  of  dependency  relation.  It  would  be  laborious  to  systematically 
explore  the  parameter  space  for  complex  computation  structures  with  explicit 
programming.  Adiga  [ADI86]  has  been  developing  a  performance  modeling  system  for 
parallel  computation  structures.  The  conceptual  basis  of  the  modeling  system  is  an 
extended  form  of  Petri  Nets.  The  extended  Petri  Nets  are  expressed  in  terms  of  vector 
replacement  systems,  a  generalization  of  vector  addition  systems  developed  by  Kapur 
[KAP82].  The  performance  modeling  system  uses  a  graphical  interface  for  model 
specification.  The  modeling  system  has  been  implemented  to  have  constructs  which  are 
direct  equivalents  of  CSL  programs.  A  CSL  program  can  be  directly  mapped  to  an 
extended  Petri  Net  model  and  vice  versa.  A  critical  aspect  of  any  performance 
modeling  system  is  validation.  The  ability  to  transform  back  and  forth  from  CSL 
facilitates  validation.  A  study  of  the  design  space  for  parallel  computations  is  executed 
by  writing  a  CSL  program  for  one  parallel  structure  and  measuring  its  execution 
properties.  The  CSL  program  determines  the  structure  of  the  model.  The  measurement 
data  can  then  be  used  to  parameterize  and  validate  the  extended  Petri  Net  performance 
model.  The  model  can  then  be  used  to  search  for  "optimal"  parallel  structures  with 
increased  confidence.  The  design  of  this  performance  modeling  system  was  done  under 
sponsorship  of  this  grant. 
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2.5.  Parallel  I/O  Architectures 

The  network  architecture  concepts  of  TRAC  have  been  extended  and  adapted  to 
generate  a  highly  parallel  I/O  architecture  which  integrates  data  base  operations, 
efficient  virtual  memory  operation  and  object-oriented  data  handling  [BR085e], 


The  basis  of  the  proposed  I/O  architecture  is  to  connect  the  external  memories  of  the 
multiprocessor  architecture  to  the  computation  elements  via  a  network  switch.  The 
nodes  of  the  switch  are  given  the  functionality  to  execute  routing  and  sort/merge 
operations  on  data  elements  as  they  traverse  the  switch  between  the  processors  and 
external  memories.  The  routing  operations  can  be  used  to  implement  indexed  store 
operations  in  the  processor  to  memory  direction  and  data  distribution  and  sort/merge 
operations  in  the  memory  to  processor  direction.  The  external  memory  systems  are 
interfaced  to  an  object-oriented  self-managing  secondary  memory  (SMSM)  [RAT8-4] 
which  is  in  turn  interfaced  to  the  network  by  a  specialized  processor  which  implements 
data  filtering  operations.  The  SMSM’s  are  composed  of  cells  which  can  execute  search 
operations  in  parallel  and  which  also  execute  memory  management  functions  such  as 
allocation  and  garbage  collection.  The  SMSM’s  are  short  latency  object-oriented  caches 
for  the  mass  storage  devices.  Figure  3  is  a  schematic  of  this  architecture. 

This  architecture  clearly  has  the  potential  for  a  great  deal  of  parallelism  in  the 
operation  of  I/O  and  data  base  operations.  Multiple  activities  can  be  initiated  at  the 
processor  level.  Indexing  operations  are  executed  in  parallel  by  the  network  switches. 
The  SMSM’s  can  execute  searches  in  parallel  and  each  SMSM  can  have  internal  search 
parallelism.  Research  to  explore  the  potential  of  this  I/O  architecture  was  initiated 
under  the  partial  sponsorship  of  this  grant. 
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Appendix  A 

AN  EXAMPLE  COMPUTATION  STRUCTURE  AND  PROGRAM 

This  section  illustrates  the  representation  concepts  described  in  the  previous  section. 
The  example  is  a  problem  used  by  J.  Dongarra  and  D.  Sorensen  of  Argonne  National 
Laboratories  to  test  the  expressive  power  of  parallel  programming  languages.  The 

computation  is  the  solution  of  a  triangular  matrix 

Tx  =  b 

Dongarra  and  Sorensen  are  evaluating  the  ease  with  which  all  of  the  potential 
parallelism  in  the  computation  can  be  obtained  in  the  programming  system.  The 
method  used  is  to  decompose  the  matrix  into  blocks  as  illustrated  in  Figure  4.  The 
steps  in  the  solution  are 

a.  solve  the  triangular  diagonal  blocks 


b.  execute  the  transformation 


Vi  -  bi  ->  bi 


on  all  of  the  blocks  in  column  i 
The  schedulable  units  of  computation  are: 

a.  INI  TAR  RAY  -  an  initializing  routine 

b.  SOLVE  -  a  block  triangular  solver 

c.  MATVECT  -  multiplies  a  matrix  times  a  vector  and  subtracts  from  another 
vector  in  place 

The  dependency  relations  are  all  of  type  data.  The  granularity  is  that  of  computation 
data  structures.  We  choose  a  data  driven  protocol.  Figure  5  gives  an  explicit 
computation  graph  for  this  computation  for  NBLOCKS=4  and  Figure  6  is  the  same 
graph  in  a  proposed  indexed  notation.  Figure  7  is  a  CSL  program  where  the 
CONSTRUCT  section  of  the  program  gives  a  text  string  representation  of  the 
dependency  graph  and  the  executable  portion  of  the  program  specifies  a  traversal  of  the 
graph. 
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This  computation  graph  shovs  the  dependency  relations 
as  data  dependencies  (single  headed  arravs)  and  constraint 
dependencies  (double  headed  arravs).  The  dependencies  b- 
disappear  since  SOLVE  and  MATVECT  replace  bf  bg  X  j  as  1 
the  solution  progresses  (solution  in  place).  The  Tj  *  ore 
assumed  stared  before  initiation  of  execution.  Note 'that  all 
SOLVE’s  can  be  simultaneously  initiated  since  they  have  aa 
input  data  dependencies.  Nate  that  if  MATVECT  vere  broken 
up  into 

TjtX'-l.artfl. 


ijand  g  j-  ij 


then  even  mare  parallelism  might  be  realized  but  with  a 
mare  complex  graph.  SOLVE(IJ)  is  SOLVE(I)  of  the  CSL 
program.  This  is  a  tradeoff  of  parallelism  far  storage. 


Figure  5  -  A  Computation  Graph  for  Parallel  Structuring 
of  the  Block  Triangular  Solution 


VAR  NBLOCKS:  INTEGER; 
END; 


BEGIN 

NBLOCKS  :=  6; 

CONSTRUCT 

TASKS 

INITARRAY  :  C2  [T(I,J),  XCl>]  RANGE  I  =  1  TO  NBLOCKS.  J  =  1  TO  I; 

(*  This  sets  up  the  values  of  array  T  and  vector  X  *) 

SOLVE  (I)  :  Cl  [T(I , I) ,X(I)]  RANGE  I  =  1  TO  NBLOCKS; 

C*  The  code  for  the  diagonal  tasks  is  identical. 

A  task  is  declared  for  each  such  block,  and  they  are 
indexed  by  row.  The  compiled  code  resides  in  file  Cl. 

Each  task  accesses  the  ith  block  of  vector  X, 
and  the  1,1th  block  of  array  T.  *) 

MATVECT  ci.  J)  :  C3  [  T(I,J),  X(I) .  X(J)]  RANGE  I  =  1  TO  NBLOCKS, 

J  =  1  TO  (I  -  1)  ; 

C*  The  nondiagonal  tasks  are  Indexed  by  row  and  column. 
Compiled  code  is  in  file  C2.  Each  matvect  task 
accesses  the  1th  and  Jth  blocks  of  vector  X.  and  the 
l.Jth  block  of  array  T.  *) 

CONDITIONS 

TC  (I,J)  :  MATVECT  (I , J)  RANGE  I  =  1  TO  NBLOCKS, 

J  =  1  TO  CI  -  1)  ; 

C*  Each  matvect  task  has  one  task  condition  associated  with 
it,  which  it  can  set  to  communicate  with  the  CSL  program. 
The  convention  used  here  is  that  the  task  conditions  are 
initially  false,  and  each  task  sets  them  to  true  as  part 
of  its  execution  to  signal  that  it  is  done.  *) 

END;  C*  end  construct  *) 


WITH  Ttl.J],  X[I]  RANGE  I  =  1  TO  NBLOCKS,  J  =  1  TO  I 
DO  EXECUTE  INITARRAY;  C*initlalize  X.  T  *) 

C //  BEGIN 

//  WAIT  TCCI.J)  RANGE  J  =  1  TO  1-1;  C*  parallel  waits,  control  dc 

not  advance  until  all  are 


satisfied.  *) 


WITH  X[I]  :  T[i, I] 

DO  EXECUTE  SOLVE  (I) ;  (*  when  all  matvect  tasks  finished  for 

this  row,  execute  diagonal. 

The  ith  block  of  X  is  changed,  so  it 
accessed  read/write.  The  values  in 
T[i,i]  are  used  in  calculations  but 
altered,  so  it  is  accessed  read-only 


//  WITH  X[J]  :  X[I]  ,  TCJ.I] 

DO  EXECUTE  MATVECT  (J.I) 

RANGE  J  =  (I  +  1)  TO  NBLOCKS ; 

(*  when  diagonal  task  for  this  column  finished,  start  off 
all  matvect  tasks  in  this  column  in  parallel, 
the  J.ith  task  reads  X[i]  and  writes  X[J]  *) 

END; 

)  RANGE  I  =  1  TO  NBLOCKS; 

END. 


Figure  7  -  CSL  PROGRAM 


This  program  starts  off  NBLOCKS  parallel  streams.  Each  parallel  stream  begins  by 
executing  a  diagonal  task  first,  and  then  all  the  matvect  tasks  in  that  column  in 
parallel.  The  start  of  the  actual  execution  of  each  stream  is  coordinated  by  checking 
the  task  conditions  of  the  tasks  that  must  precede  the  diagonal  task  execution  for  that 
column.  The  first  stream  is  started  immediately,  since  the  range  index  for  the 
conditions  varies  from  1  to  0.  The  second  stream  executes  SOLVrE(2)  as  soon  as 
\L\TVECT(2,1)  is  completed,  and  has  communicated  that  fact  to  the  CSL  program.  It 
doesn’t  have  to  wait  for  MATVECT  (3,1).. MATVECT  (NBLOCKS. 1).  Since  each 
MATVECT  (I, J)  puts  a  lock  on  the  Ith  block  of  X  for  the  duration  of  its  execution,  the 
MATVECTs  for  each  row  will  have  to  execute  sequentially.  This  could  be  remedied  by 
splitting  NLA.TVECT  into  two  tasks,  the  first  one  reading  and  computing,  and  then 
transmitting  the  result  to  the  second  task,  which  obtains  the  necessary  write-lock  for  a 
shorter  period  of  time. 


Note  that  the  range  statement  which  is  the  last  executable  statement  of  the  program 
initiates  parallel  execution  of  NBLOCK  SOLVE’s.  The  RANGE  statement  on  each 
MAT\rECT(J,I)  initiates  parallel  execution  of  a  set  of  MATVECT  tasks.  The  statement 
WITH  X[I]:  T[I,I] 

DO  EXECUTE  SOLVE  (I)) 

requires  SOLVE  (I)  to  obtain  exclusive  access  to  X[I]  and  T[I,I]  before  executing. 
Constructing  the  dependency  graph  leads  naturally  to  the  very  simple  program  of 
Figure  7.  The  task  condition  variables  TC(I,J)  are  used  to  implement  the  mutual 
exclusion  relations  between  the  occurrence  of  SOLVE  and  the  occurrences  of 
MATVECT  along  a  row  of  blocks.  This  CSL  program  selects  one  traversal  from  many 
possible  traversals.  This  is  one  of  several  places  where  the  current  CSL  falls  short  of  the 
generality  needed  for  representation  of  the  computation  graphs  described  in  Section  3. 
Fortran  code  for  the  tasks  SOLVE  and  MATVECT  as  supplied  by  Dongarra  and 
Sorensen  is  given  in  Appendix  B.  This  program  has  been  executed  on  the  experimental 
version  of  the  CSL  programming  system  on  the  dual  CDC  170/750  configuration  in  the 
Computation  Center  at  UT-Austin. 

The  important  element  of  the  graph  representation  is  that  it  contains  no 
architecturally  dependent  information.  The  CSL  program  again  contains  no 
architecturally  specific  elements.  The  WAIT  and  WITH  constructs  and  also  the 
message  passing  elements  which  can  be  used  in  CONSTRUCT  statements  can  be  easily 
implemented  on  shared  memory  or  fixed  topology  architectures. 


